STA5073Z Assignment 2

Author

Susana Maganga, Ndivhuwo Nyase and Puja Pande

1 Introduction

A State of Nation Address (SONA) is a speech given by the President of South Africa in which the president reports on the status of the nation. It functions as an annual report that include political and socio-economic topics. These include but not limited to topics surrounding the nation’s budget, economy, news, agenda, progress, achievements and the president’s priorities and legislative proposals. The SONA gives the public an idea of broader political landscape and socio-economic issues and challenges that are plaguing the country. At the same time, it also reflects on the achievements, growth and progress the country has made in the past year. Furthermore, it provides the public with a direction and understanding of the countries plans.

In this light, the comprehensive analysis of previous SONA speeches from 1994-2023 can provide us with an extensive perspective and viewpoint of the struggles and triumphs over the course history in South Africa. In this assignment, we aim to provide that perspective and viewpoint from a computational approach. We utilise sentiment analysis and topic modelling techniques, specifically Latent Dirichlet Allocation (LDA), to quantify the emotional tone of each president and recognise reoccurring topics that will provide context and understanding to the broader socio-political and economic environment of South Africa.

Sentiment analysis refers to the use of natural language processing (NLP), text analysis and computational linguistics, to identify, extract, quantify, and study affective states and subjective information. On the other hand, topic modelling is a statistical and unsupervised learning technique to identify and discover topics within a collection of text or documents. It is commonly used to classify documents and discover hidden semantic properties in a corpus of text. A related study involves exploring sentiment and topics from Philippine President in the SONA from John Miranda and Rex P. Bringula (2021). The study showcased the SONAs generally expressed positive sentiments while the lowest negative sentiment was during the martial law period in 1974. In this study, they were also able to present which concerns were focused on in their speeches. Furthermore, another related study includes Mining Tourist’s Perception toward Indonesia Tourism Destination. Results shows that joy is the most prominent emotion accompanying visitors’ experiences (Herry Irawan, Riefvan Achmad (2019)). The relevant work showcases that researchers have found significant results when investigating documents of texts with regards to evaluating emotive content or extracting common themes and topics.

2 Data and Methods

2.1 Data Collection

As the aim of this project was to analyse previous SONA speeches delivered by the presidents of South Africa, the data used for this project were SONA speeches from 1994 through to 2023. These were accessed from the South African Government website (https://www.gov.za/state-nation-address) and include speeches from six former presidents: FW de Klerk, Nelson Mandela, Thabo Mbeki, Kgalema Motlanthe, Jacob Zuma and the current president, Cyril Ramaphosa. Overall, 36 speeches were analysed. The sections to follow detail how these speeches were analysed.

2.2 Data Cleaning and Preprocessing

In order to analyse and model the SONA speeches, they had to be cleaned and preprocessed first. The .txt files were first read into R where they were all added to a single dataframe. The information regarding the date and year in which the speech was delivered as well as the president delivering the speech were extracted and adding as columns to the dataframe. In addition to this, the dates were all formatted to the same format, all the punctuation was removed and the words in the speeches were all converted to lowercase. The text was also cleaned to remove the dates that appear at the beginning/towards the beginning of the speeches. Lastly, all numbers were removed for the LDA analysis.

The speeches were then converted to a ‘tidy’ format whereby the data were then processed so that all variables were in columns, all observations were in rows and every value was in a cell where a cell only had one value. During this process of converting the speeches to a ‘tidy’ format, the process of tokenisation was also implemented which is the process of taking a text string, such as each individual speech and decomposing it into chunks which in this case were words. Essentially, the speeches were broken down into words and converted to a format where each observation only consisted of one word. After this was done, stop words or words that occur commonly, were removed from the ‘tidy’ dataset so that these would not add noise while analysing the data.

After the data had been cleaned and pre-processed, exploratory data analysis was conducted on the data and sentiment analysis as well as LDA was carried out on the speeches and words.

2.3 Exploratory Data Analysis

Before conducting any in-depth analysis of the data through sentiment analysis and LDA, exploratory data analysis (EDA) was performed on the data in order to gain a better understanding of the data.

First, the number of words per president were counted. The most number of words from a president was from Mbeki with 5622 words followed by Zuma with 5258, Ramaphosa with 4753, Mandela with 4311 and finally Motlanthe and de Klerk with 1631 and 431, respectively. The lower number of words for Motlanthe can be attributed to the fact that he was only president for roughly six month and hence participated in only one SONA. For de Klerk, this may have been because this was the end of the Apartheid regime for which he was the last president. Furthermore, the top ten most used words across all speeches were as follows: government, south people, country, national, development, africa, public, economic, ensure. All of which one would expect to hear given the nature of the SONA speeches.

Lastly, the top 15 words spoken by each president are summarised in the figures below.

15 Most Spoken Words Per President Including Common Words

The first figure summaries the top 15 words spoken by president while the second figure shows the same with the exclusion of the words: government, people, south, africa, african.

15 Most Spoken Words Per President Not Including Common Words

These are widely used by most presidents and may not be valuable in indicating trends with regards to the most words used. The second figure changes significantly after the removal of the common words. It can be seen that the words associated with de Klerk are focused around freedom and the transitioning of South Africa from an Apartheid state to a democratic one through words such as: ‘constitution’, ‘freedom’ and ‘transitional’. Mandela’s most used words ‘development’ and ‘society’ may be attributed to the re-development of South Africa as a democratic state but words such as ‘crime’ draw attention to the wider social issues. Mbeki’s words seem to be more related to the economy and development while Motlathe’s address similar issues with the addition of poverty. Ramaphosa’s words are even more so centered around economic development and businesses which may be due to the effects of the Covid-19 pandemic during his time as president. Lastly, Zuma’s words follow a similar economic and development trend to the rest.

2.4 Sentiment Analysis

Sentiment analysis is a text mining technique that aims to extract the thoughts and feelings of script to determine their polarity, i.e. positive, negative or neutral. Three approaches are available to conduct sentiment analysis: supervised, lexicon-based and hybrid. The supervised method utilises machine learning algorithms to train the classifier. It is superior in performance to the lexicon-based method, however it requires a substantial amount of labelled data. Lexicon-based sentiment analysis uses sentiment lexicons (dictionaries) to describe polarity. This method is more computationally efficient, but the results may vary depending on the lexicon and domain. A word may be subjective or objective depending on the context, e.g. in the clause “crude oil”, this is an objective use of the word crude; when it is used as “crude language”, it is now subjective and has a negative sentiment. Dealing with negation and sarcasm is also a challenge with this approach. The hybrid method is an amalgamation of the supervised and lexicon-based methods (Sadia et al, 2018).

2.5 Latent Dirichlet Allocation

LDA is a popular topic modelling model which allows one to better understand hidden themes in a collection, classify the documents into these themes and summarise the documents (Kulshrestha, R., 2020). In an LDA model, each document is comprised of various words as is each topic. This document being referred to is achieved after the ‘tidy’ data is converted to a ‘DocumentTermMatrix’.

In an LDA model, one of the hyperparameters is \(k\). If there are \(k\) topics, each topic from the document is generated from a distribution with different probabilities. So if \(z_{km}\) is the \(k\)th topic in the \(m\)th document, it takes a value between 1 and \(K\) (Zhang, Z., 2018).

\[ z_{km} \sim Multinomial(\theta_{m}) \]

where \(\theta_m = (\theta_{m1}, \theta_{m2}, ..., \theta_{mk})'\) is the topic probability.

Once a topic has been decided upon, words are organised around it. So if \(w_{mn}\) is the \(n\)th word used in the \(m\)th document, it would take a value between 1 and \(V\) with \(V\) being the total number of unique words used in all the documents. The equation below shows how a word is generated:

\[ w_{mn}|z_{km} \sim Multinomial(\beta\_k)\]

where \(\beta_k = c(\beta_{k1}, \beta_{k2}, ..., \beta_{kV})'\) is the probability that a word is picked given the topic k is selected.

Lastly, it is important to get \(\theta\) and \(\beta\) values as these represent the distribution of topics for a particular document and the distribution of words within each topic. The assumption is that both are generated from a Dirichlet distribution. For topic probability:

\[ \theta_m \sim Dirichlet(\alpha) \] And for word probability:

\[ \beta_m \sim Dirichlet(\delta) \]

In the context of this project, first the ‘tidy’ data was converted to a ‘DocumentTermMatrix’ to which the LDA model was applied to. First, the top 20 most used words overall were removed from the dataset as a means of pruning the data. A grid search for the best \(k\) hyperparameter was conducted using the metric of coherence values. This measures the relative distance between words within a topic and suggested that a \(k = 17\) be used. However, upon inspection of these topics, there was much overlap. The next highest values were for greater than 17. Finally, a \(k=9\) was chosen as the topics were clear and there was little overlap. An LDA model was created on the overall data as well as for each president. For each individual president, a value of \(k=2\) was chosen.

2.6 Use of Large Language Models

While not an aim of this project, it was a requirement that we experiment with the use of a large language model such as ChatGPT to assist with the assignment. While using basic search engines to help with coding and writing are helpful, ChatGPT is more efficient in providing specific answers to some questions as it can provide textual context, conversational interactions and can remember previous questions and refer to them.

In terms of coding, ChatGPT was useful in assisting with code for various function parameters as well as the syntax for certain elements of this document e.g. the equations. However, it sometimes produced code that was not correct or used incorrect functions not belonging to the packages specified. In this case, the error produced in R was inputted back into the conversation and it was almost always able to provide the correct code, albeit after many tries sometimes. The inputs, however, into the conversation had to be very specific to provide the correct context for ChatGPT and to avoid unnecessary back-and-forth on simple questions stemming from a lack of context. One big plus was ChatGPT’s ability to remember previous questions in a conversation so that once the context had been specified, it could draw from previous questions to answer future ones. Overall, from a coding point of view, this large language model performed decently but there was often a lot of prompting that had to be done.

An example of applying ChatGPT to generate code was requesting for a function that extract sentences containing specific bigrams. Before providing much context, the response given was steps to manually do it in a programming language. When prompted for code, ChatGPT assumed that it was being done in Python. To provide context, the programming language as well as the structure of the tidy data was provided as a response. R code was subsequently returned and it worked exactly as expected. Comments were also provided to guide the user.

# Define the specific bigram
specific_bigram <- "specific bigram"  # Replace with your desired bigram

# Initialize a list to store sentences containing the bigram
sentences_with_bigram <- character(0)

# Search for the bigram in each sentence
for (sentence in sona_sentences$sentence) {
  if (grepl(specific_bigram, sentence, ignore.case = TRUE)) {
    sentences_with_bigram <- c(sentences_with_bigram, sentence)
  }
}

# Print or manipulate the extracted sentences
print(sentences_with_bigram)

In terms of understanding specific terminology and better understanding the models, it was able to provide a decent summary of these which in conjunction with other resources helped with the understanding of these models. And lastly, in terms of writing, ChatGPT was a useful tool for simplifying language, paraphrasing sentences and providing succinct alternatives to inputted sentences.

3 Results and Discussion

3.1 Sentiment Analysis

The results obtained from employing the lexicon-based approach to determine the outlook of the South African Presidents will be discussed. The bing and nrc lexicons were explored. In the instance that a word was not in the lexicon, the default label was set to “neutral”. The bing lexicon was developed by Minqing Hu and Bing Liu as the Opinion Lexicon. It comprises of 6786 words, where 2005 are “positive” and 4781 are “negative”.

The nrc lexicon was compiled by crowdsourcing on Amazon Mechanical Turk by Saif Mohammad and Peter Turney. The lexicon has 13872 words and incorporates more sentiments in addition to positive and negative: anger, anticipation, disgust, fear, joy, sadness, surprise and trust.

Words in the nrc lexicon
Category No. of Words
anger 1245
anticipation 837
disgust 1056
fear 1474
joy 687
negative 3316
positive 2308
sadness 1187
surprise 532
trust 1230

3.1.1 Word-Level Analysis

In this subsection, a word-level analysis was conducted. The top 20 positive and negative words in the speeches as a whole, as well as a breakdown per president using the bing lexicon are presented below.

Top 20 positive words used in speeches based on the bing lexicon

Top 20 positive words by president based on the bing lexicon

Overall, the word improve was the most used positive word in all of the speeches. This is also true when the words are grouped by president, except for deKlerk, whose most used word was freedom. Other recurring words include progress and peace/peaceful.

Top 20 negative words used in speeches based on the bing lexicon

Top 20 negative words by president based on the bing lexicon

Poverty, crime and corruption were the most commonly used negative words. They also appeared in the breakdown by president; except for deKlerk, who did not have any of these three words in his top ten negative words. deKlerk commonly used the words violent, illegal and discrimination.

3.1.2 Sentiment-Level Analysis

The general feeling of the speeches, broken down by sentiments, for each president were examined using both the nrc and bing lexicons. The neutral words were excluded, as they dominate over the other sentiments.

Proportion of positive to negative words by president based on the bing lexicon

The proportions of positive to negative words do not seem to differ by a large amount. However, the plots indicate that Zuma, Mbeki and Mandela expressed relatively more positive sentiments in comparison to the other presidents. To further delve into the presidents’ attitudes, the more comprehensive nrc lexicon was applied.

Proportion of positive to negative words by president based on the bing lexicon

From the plots, it can be observed that, besides negative and positive sentiments, prevalent themes include trust and anticipation. To gain some insight of the words specifically associated with these themes, the top 2 words associated with trust or anticipation for each president are presented in a table below.

Top 2 words by president associated with either anticipation or trust according to the nrc lexicon
President Word nrc bing
deKlerk constitutional trust neutral
deKlerk freedom trust positive
Mandela public anticipation neutral
Mandela nation trust neutral
Mbeki public anticipation neutral
Mbeki continue anticipation neutral
Mbeki continue trust neutral
Motlanthe public anticipation neutral
Motlanthe system trust neutral
Ramaphosa economy trust neutral
Ramaphosa public anticipation neutral
Zuma continue anticipation neutral
Zuma continue trust neutral

Most of the words associated with trust or anticipation in the nrc lexicon are categorised as “neutral” in the bing lexicon. This highlights the variability that different dictionaries can yield in an analysis. Additionally, it is important to note that due to the multi-level nature of words in the nrc lexicon, some words are associated with more than one sentiment, such as continue.

Changes of the sentiments over time were also investigated. Line plots with corresponding smoothed lines were generated to assess any shifts in positive or negative sentiments.

To assess whether the shifts are statistically significant, a logistic regression model was employed. Logistic regression is a modelling technique, based on regression, in which the dependent variable is binary. This is an appropriate model to use because the dependent variable has a binary outcome: positive or negative.

Proportion of positive to negative words by president based on the bing lexicon
Results from applying logistic regression
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0 1.0694006 0 1
date 0 0.0000745 0 1

From the plot, it appears that there has generally been a decline in negative sentiment over the years.

The p-values obtained from the logistic regression, however, are above the traditional threshold of 0.05, indicating that there was no significant decrease in the proportion of negative sentiments over time.

3.1.3 Bigrams Analysis

An n-gram is a sequence of n words in a text. Bigrams, in particular, are pairs of words that are adjacent to each other. Certain words, such as “not” change the meaning of the subsequent word by negating it. It is therefore important to consider this when conducting sentiment analysis.

This subsection section will explore the most common positive and negative bigrams in the speeches using the bing dictionary. To determine the net sentiment of a bigram, positive words are assigned a value of 1 and negative words are assigned a value of -1. Words that are preceded by a negation word (‘not’, ‘no’, ‘never’, ‘without’ or ‘anti’) had their sentiment reversed. During the analysis, a few of the labels given to words in the bing dictionary did not align with this context and were thus adjusted. Consequently, ‘honour’ and ‘hope’ were changed from neutral to positive. The word ‘oppression’ was reclassified from neutral to negative.

The findings illustrated in the plots partially align with what was observed in the top 20 positive and negative words. In the case of positive words and bigrams, several words appeared in both sets, such as improve, support, and progress. For the negative bigrams, the main themes were similar as before: poverty, crime and corruption. Interestingly, none of the negative bigrams contained negation words; these were subsequently extracted separately.

The most frequently occuring negation words in the bigrams were not and no. Some sentences that include the “not enough” bigram are:

  • “But the response is that not enough criminals are being arrested and the quality of investigation is poor.”

  • “Not enough jobs are being created.”

  • “While structural reforms are necessary for us to revive economic growth, they are not enough on their own.”

These examples highlight the importance of taking negation words into account as they reverse the sentiments of the sentences.

3.2 Latent Dirichlet Allocation

3.2.1 Analysis of All Presidents

As mentioned earlier, 9 topics were produced from the data. The 20 most used words can be seen in the figure below.

The 20 Most Used Terms per Topic

While there is still some overlap among the topics, the general idea of these is fairly clear. The first topic seems to focus on a new democratic South Africa and the challenges this brings in the improvement on formerly oppressed bodies. Topic two mentions the words, ‘crime’, ‘security’, ‘improve’ and ‘progress’, which among many other words touches on the crime issues related to South Africa as well as the need for progress as violent crimes have been prevalent in South Africa since the end of Apartheid. The third topic focuses on ‘energy’, ‘infrastructure’, ‘health’, ‘water’, ‘mining’ and ‘women’. This may be a nod to South Africa’s water and energy infrastructure and development as well as the need to invest more in these plans and infrastructure to ensure consistent supply. ‘Women’ and ‘children’ which may be an indication of the vulnerability of these groups to the aforementioned issues or the need to include women in these sectors. Topic four looks at ‘water’ and ‘infrastructure’ again but also mentions ‘cape’, ‘communities’, ‘rural’ which may be referring to the Cape Town drought which occurred from 2015 - 2018 where rural areas were affected badly. Topic 5 looks at resource issues again but with a focus on the economy and business and to provide support to them as the following words are mentioned: ‘water’, ‘crisis’, ‘challenges’, ‘electricity’, ‘businesses’, ‘employment’, ‘support’ and ‘companies’.

Topic six mentions ‘women’, ‘crisis’, ‘hope’, ‘improve’ which may refer to the gender-based violence issues in South Africa which disproportionate affect women. Topic seven refers to a transition from an Apartheid South Africa to a democratic South Africa. Topic 8 contains ‘billions’, ‘energy’, ‘support’, ‘commission’, ‘corruption’ all which refer to South Africa’s high corruption levels, especially in regards to the energy sector, more specifically Eskom. Lastly, the ninth topic seems to focus on ‘service’ ‘issues’ and ‘poverty’ ‘issues’ which need to be addressed for the ‘black’ majority of South Africa who have been disproportionately been affected by poor service delivery and poverty due to the ramifications on Apartheid laws and poor initiative to uplift since then.

Overall, it was found that, per topic, Mandela and Ramaphosa appeared 7 times, Mbeki and Zuma 10 times and Motlanthe and de Klerk 1 time. This shows each presidents contribution to the topics.

3.2.2 Analysis of Individual Presidents

Next, topic modelling was conducted for each president. While two topics were produced for each president, for those with a lower number of words, there was some overlap. Going chronologically, we start with de Klerk. de Klerk’s first topic focused on words such as ‘freedom’, ‘constitution’, ‘peaceful’, support’ as well as ‘zulu’ which as true to the climate during the speech was in reference to the transitioning to a democratic and free South Africa. Zulu is mentioned as it was ensured in the constitution that the Zulu leaders be given a certain amount of political power as respect to the Zulu Kingdom. His second topic refers to ‘election’ and ‘future’ as this one is more focused on the future of South Africa and the move forward.

Topic Modelling for de Klerk

Mandela’s first topic mentions ‘progress’, ‘past’, ‘hope’, ‘improve’, ‘time’, ‘building’, ‘democracy’ all of which are a related to improving the life of previously marginalized groups through a new, democratic state. His second topic focuses on many issues South Africa still faces such as crime and women’s rights as well as security. Police is also mentioned in this topic as well as reconstruction. This topic may be related to the social issues South Africa is facing and potential methods of change.

Topic Modelling for Mandela

Mbeki’s first topic addressed the need for improvement of services and infrastructure for local communities as well as some mention of international resources. The second topic looks at some of South Africa’s issues such as ‘poverty’ and ‘crime’ and potential words that could be used to address these issues. ‘Water’ is also mentioned in this topic which may be an additional issue being addressed.

Topic Modelling for Mbeki

As Motlanthe only gave one speech, his topics are not very well defined. The first looks issues related to South Africa as well ways for improvement. While the second speaks to many things such as ‘democracy’, ‘poverty’ and ‘children’ too.

Topic Modelling for Motlanthe

On the topic of water, Zuma’s first topic mentions both ‘water’ and ‘cape’ as an indication of reference to the Cape Town droughts. It also mentions ‘business’, ‘health’ and ‘education’ all of which were sectors affected by this droughts. The topic also mentions ‘job’ and ‘jobs’ both of which may refer to the unemployment rates and promises of job creation. The second topic looks at resources such as ‘electricity’, ‘water’, ‘energy’ and also words such as ‘support’ and ‘women’. This topic may be referring to the consistent supply of these resources and ways to ensure this while also focusing on support for women, especially in light of gender-based violence issues.

Topic Modelling for Zuma

Lastly, the current president, Ramaphosa’s, first topic is in relation to ‘energy’, ‘eskom’ and ‘challenges’ which is a nod to the energy issues South Africa is facing in relation to Eskom. ‘Investment’ and related words are mentioned implying this topic also includes the measures that are currently in place. ‘Health’ is also mentioned as well as ‘capacity’ which may be referring to the Covid-19 pandemic. The second topic mentions ‘corruption’ and ‘electricity’, ‘investment’ and related words which may be a nod to the corruption in the energy sector, especially Eskom.

Topic Modelling for Ramaphosa

4 Conclusions

A sentiment analysis of the State of Nation Addresses was conducted to discern the overall feelings expressed by the South African presidents over time. It was discovered that most of the presidents, with the exception of deKlerk, had common interests: peace, freedom and progress. Recurring concerns were centered around poverty, crime and corruption. The analysis indicated that concerns over these themes did not change over time, as the logistic regression model indicated that the sentiment trend was consistent.

In the LDA, the topics found for each president summarised the political climate of their times well and were insightful into seeing which issues were at hand and the progression of. Overall, the topics progress from democracy and freedom related words to economic and development jargon which was mentioned by almost all the presidents with a nod to non-economic and social issues such as crime, poverty, women or water throughout the speeches.

Lastly, in terms of limitations and scope for future improvement, some of the presidents were under-represented which limited the validity of the analysis of their sentiments and topics. There were also hindrances due to the subjective nature of lexicons as many words may have been classified. With the LDA, the topic interpretations were done manually and the interpretations are subject to the researcher. For future work, researchers may want to balance their dataset for a more representative analysis and the application of automatic keyword extraction may be used to classify the topics in real time.

5 Author Contributions

Author Contributions to This Project
Section Contributors
Sentiment Analysis Susana Maganga
LDA Puja Pande
Trends Over Time for LDA and Sentiment Analysis Ndivhuwo Nyase
EDA Puja Pande
Use of Large Language Models Puja Pande and Susana Maganga
Report Write-Up (All Sections) Ndivhuwo Nyase, Susana Maganga, Puja Pande (All Authors)

6 References

Irawan, H., Akmalia, G. and Masrury, R.A., 2019, September. Mining tourist’s perception toward Indonesia tourism destination using sentiment analysis and topic modelling. In Proceedings of the 2019 4th International Conference on Cloud Computing and Internet of Things (pp. 7-12).

Kulshrestha, R. (2020). Latent Dirichlet Allocation(LDA). [online] Medium. Available at: https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2. [Accessed 17 October 2023].

Miranda, J.P.P. and Bringula, R.P., 2021. Exploring Philippine Presidents’ speeches: A sentiment analysis and topic modeling approach. Cogent Social Sciences, 7(1), p.1932030.

Sadia, A., Khan, F.K., & Bashir, F. (2018). An Overview of Lexicon-Based Approach For Sentiment Analysis.

Zhang, Z. (n.d.). Text Mining for Social and Behavioral Research Using R. [online] books.psychstat.org. Available at: https://books.psychstat.org/textmining/topic-models.html [Accessed 17 October 2023].